Conceptualization to Develop Machine Learning Techniques for Information Extraction: Consistency Queries
نویسندگان
چکیده
The information extraction from documents is an increasingly urgent problem of enterprise knowledge management. Knowledge sources may be internal like text files and forms of business administration processes or external like HTML pages, e.g. When the number of knowledge sources is paramount, substantial computer support is inevitable. Machine learning techniques play a crucial role. A prototypical development system named LExIKON has been developed which supports interactive information extraction from semi-structured documents. The central mechanism inside LExIKON involves learning of formal languages. These formal languages serve as parameters of so-called wrappers which are synthesized programs performing the intended information extraction. The essence of the LExIKON technology and the functionality of the LExIKON development system is sketched by means of a sample session documented and discussed using several screenshots. The automatic generation of – hypothetical – wrappers for information extraction through the invocation of machine learning techniques is raising several questions. What can we expect of a wrapper generated in case it is not yet completely correct? Can we generate wrappers in a properly incremental fassion? For answering those practically relevant questions, a new formal framework of learning – learning by consistency queries – is introduced and studied. The overall scenario of learning by consistency queries for information extraction is formalized and different constraints on the query learners are discussed and formulated. The aim of the paper is to demonstrate the value of theoretical conceptualization for the development and evaluation of application-oriented machine learning techniques – and to answer a few of the practically motivated questions.
منابع مشابه
Machine learning based Visual Evoked Potential (VEP) Signals Recognition
Introduction: Visual evoked potentials contain certain diagnostic information which have proved to be of importance in the visual systems functional integrity. Due to substantial decrease of amplitude in extra macular stimulation in commonly used pattern VEPs, differentiating normal and abnormal signals can prove to be quite an obstacle. Due to developments of use of machine l...
متن کاملUsing Machine Learning Algorithms for Automatic Cyber Bullying Detection in Arabic Social Media
Social media allows people interact to express their thoughts or feelings about different subjects. However, some of users may write offensive twits to other via social media which known as cyber bullying. Successful prevention depends on automatically detecting malicious messages. Automatic detection of bullying in the text of social media by analyzing the text "twits" via one of the machine l...
متن کاملInteractive Learning of Node Selecting Tree Transducers⋆
We develop new algorithms for learning monadic node selection queries in unranked trees from annotated examples, and apply them to visually interactive Web information extraction. We propose to represent monadic queries by bottom-up deterministic Node Selecting Tree Transducers (Nstts), a particular class of tree au-tomata that we introduce. We prove that deterministic Nstts capture the class o...
متن کاملLearning n-ary tree-pattern queries for web information extraction
The problem of extracting information from the Web consists in building patterns allowing to extract specific information from documents of a given Web source. Up to now, most existing techniques use string-based representations of documents as well as string-based patterns. Using tree representations naturally allows to overcome limitations of string-based approaches. While some tree-based app...
متن کاملConventional Machine Learning for Social Choice
Deciding the outcome of an election when voters have provided only partial orderings over their preferences requires voting rules that accommodate missing data. While existing techniques, including considerable recent work, address missingness through circumvention, we propose the novel application of conventional machine learning techniques to predict the missing components of ballots via late...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2002